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ABSTRACT 


One  of  the  most  serious  deterrents  to  the  development  of  multiple  processor 
architectures  has  been  the  problem  of  providing  adequate  communication  between  the 
discrete  processing  elements.  This  paper  examines  two  communications-based 
constraints. 


The  first  constraint  is  related  to  the  physical  structure  of  the  VLSI  chip.  The 
wider  the  communication  path  the  more  pins  are  needed  to  effect  the  data  transfer.  As 
Integrated  Circuits  grow  in  computational  power,  more  communication  capacity  is 
needed,  pushing  designs  closer  to  the  pin  limitations  of  the  packaging  technology. 

The  second  constraint,  somewhat  related  to  the  first,  is  the  limited  speed  with 

*  1  >.  %  \ 

which  data  can  be  transmitted  via  internal  channels.  Typical  speeds  one  can  achieve 
on  a  single  wire  are  on  the  order  of  1  Gbps.  The  recent  development  of  an 
Optoelectronic  Multiplexer  may  allow  VLSI  chips  to  communicate  at  rates  up  to  7 
Gbps.  An  architecture  for  a  parallel  processing  computer  which  takes  advantage  of 
this  new  capability  is  presented.  The  feasibility  of  a  single-chip  parallel-processor 
based  on  the  Optoelectronic  Multiplexer  is  examined  by  projecting  current  trends  in 
processor  speed,  power,  and  transistor  count  into  estimates  of  throughput  for  a 
multi-processor  IC. 
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Receiving  Bus  Interface  Unit  Architecture 


I.  INTRODUCTION 


Farmers  once  used  oxen  to  plow  their  fields.  And  when  the  task  got  too  big  for 
one  ox  they  did  not  try  to  grow  a  bigger  ox.  They  got  two  of  them!  [Ref.  1] 


A.  THE  NEED  FOR  PARALLEL  PROCESSING 

So  too  have  we  often  found  that  one  computer  is  not  enough,  or  at  least,  not  fast 
enough  for  many  applications.  While  progress  on  producing  faster  single  processor 
computers  continues,  it  is  the  orders  of  magnitude  leap  in  speed  possible  in 
multiple-processor  computers  that  promises  to  lead  computing  ir*o  its  Fifth 
Generation. 


[Multiple-processor  computers  became]  necessary  because  a  limit  to  higher  speed 
had  been  reached  with  brute-force  approaches  employing  faster  switching  devices. 
Faster  components  made  with  gallium  arsenide  or  Josepnson  junction  devices  can 
increase  computer  speed  only  10  times  if  current  uniprocessor  architectures  are 
used;  however  with  tne  new  architectures,  there  is  hope  of  increasing  speed  100  to 
1000  times.  [Ref.  2] 

Such  dramatic  increases  in  computer  speed  would  be  of  great  benefit  to 
researchers  working  on  computationally-intensive  and/or  real  time  problems  such  as 
adaptive  antenna  control,  weather  prediction,  or  fusion  reactor  design.  It  is  not  merely 
a  question  of  having  the  answers  in  seconds  instead  of  minutes— once  machines  can 
perform  calculations  in  real  time ,  whole  new  applications  suddenly  become  possible. 

As  an  example,  consider  a  computer  system  which  calculates  the  power  spectral 
density  of  intercepted  radar  emitters.  A  system  which  takes  an  hour  to  analyze  a  few 
seconds'  worth  of  data  may  be  useful  to  compile  electronic  intelligence  data  back  at 
fleet  headquarters-it  produces  answers  long  after  the  event  is  over.  However,  if  the 
system  could  perform  its  analysis  in  real  time  it  could  be  used  onboard  ship  or  in  an 
aircraft  to  recognize  hostile  missile  seekers  and  dispense  chaff  or  activate  jammers-that 
is,  to  respond  to  events  as  they  happen.  Increased  speed  alone  could  make  this  new 
application  possible. 


B.  PARALLEL  PROCESSORS  DEPEND  ON  COMMUNICATION 

When  using  a  number  of  processors  on  a  single  problem,  the  exchange  of  data 

between  processors  becomes  a  critical  bottleneck.  [Ref.  3] 

Extensive  research  has  already  been  conducted  in  many  areas  related  to  parallel 
processing,  such  as  task  dist*  i  ution  ana  software  development.  The  research  reported 
in  this  paper  focused  on  the  architecture  of  parallel-processing  systems,  especially  with 
regard  to  inter-processor  communications. 

A  system  which  uses  more  than  one  processor  to  perform  a  task  must  provide 
communication  paths  between  the  processors.  There  arc  essentially  two  approaches  to 
this  requirement: 

•  provide  a  path  from  every  processor  to  every  other  processor~"exhaustive" 
communications 

•  provide  paths  between  each  processor  and  only  some  of  the  other 
processors~"limited"  communications. 


Figure  1.1  Exhaustive  Communicatons. 

1.  Exhaustive  Communications 

An  exhaustive  communication  architecture  (Figure  1.1)  provides  direct  data 
exchange  without  bus  contention  or  waiting.  However,  as  the  number  of  processors 
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rises,  the  number  of  communication  paths  in  an  exhaustive  architecture  becomes 
impractically  large,  leading  to  high  costs.  In  addition,  expansion  of  the  network  may 
be  limited  by  the  inability  of  the  existing  processors  to  accept  another  communication 
port.  These  difficulties  with  exhaustive  communciation  architectures  have  led  many 
researcher'  to  consider  aichitectures  based  on  limited  communications. 

2.  Limited  Communications 

In  limited  communication  architectures,  [Ref,  4]  identifies  two  major  groups: 
dedicated  path  and  shared  path  structures.  Limited  architectures  employing  dedicated 
paths  enable  a  processor  to  exchange  data  without  bus  contention  or  waiting,  but  only 
with  a  limited  number  of  processors.  Figures  1.2  and  1.3  show  two  examples  of  a 
limited  communication  architecture  employing  dedicated  paths. 


Figure  1.2  Limited  Communications-Dedicated  Path 

Loop. 

Parallel-computing  systems  built  around  a  limited  communications-dedicated 
path  concept  can  take  advantage  of  the  immediate  communication  between  a  given 
processor  and  the  processors  adiacent  to  it.  Yet  if  a  problem  requires  communication 
between  non-adjacent  processors,  the  message  must  be  passed  along  by  all  the 
intermediate  processors.  Should  the  message  reach  a  busy  node,  it  may  be  delayed  or 
even  discarded,  forcing  a  re-transmission.  The  resultant  communication  overhead 
could  tie  up  the  system  and  severely  slow  its  operation. 


i 

( 

■ 

m 

■ 

i 

■ 

i 

■ 

■ 

■ 

■ 

1 

■ 

i 

■ 

■ 

m 

■ 

■ 

■ 

i 

i 

■ 

■ 

■ 

■ 

■ 

■ 

I 

i 

■ 

■ 

■ 

■ 

■ 

i 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

2 

2 

Figure  1.3 


Limited  Conununications-Dedicated  Path 
Regular  Network. 


Using  a  shared  path  (as  in  Figure  1.4)  eliminates  the  need  to  relay  data  from 
one  processor  to  another,  because  an  uninterrupted  path  already  exists  between  any 
two  processors.  For  this  reason,  limited  shared-path  architectures  are  more  flexible  in 
the  kinds  of  data  flows  which  can  be  achieved  and  in  the  types  of  problems  which  can 
be  solved  than  limited  dedicated-path  architectures.  However,  because  processors  must 
wait  their  turn  to  use  the  common  communication  path,  system  throughput  may  suffer. 
That  is,  unless  the  common  bus  runs  at  such  a  high  speed  that  the  processors  can 
barely  keep  up  with  the  bus.  Such  a  high  speed  bus  design  would  require  a  multiplexer 
on  each  chip  capable  of  speeds  considerably  in  excess  of  the  speeds  associated  with 
conventional  multiplexers.  The  Optoelectronic  Multiplexer  (OM)  developed  by  the 
Naval  Ocean  Systems  Center,  San  Diego,  is  such  a  device. 


C.  THE  OPTOELECTRONIC  MULTIPLEXER  CONCEPT 
1.  Optical  Switching  Yields  High  Speed 

The  Optoelectronic  Multiplexer  employs  optically-activated  junctions  to 
sequentially  link  parallel  data  lines  onto  a  serial  bus.  [Ref.  5]  A  laser  pulse,  fed  to  the 
junction  by  optical  fiber,  activates  the  junction,  allowing  conduction  from  the  input 
line  onto  the  main  data  transmission  line.  By  using  a  different  length  of  optical  fiber 
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Figure  1.4  Limited  Communications-Shared  Path. 

for  each  junction,  the  laser  pulses  will  arrive  at  the  junctions  at  different  times. 
Consequently,  the  junctions  are  activated  one  at  a  time,  which  converts  the  parallel 
data  waiting  on  the  input  lines  to  serial  data  pulses  travelling  along  the  output 
transmission  line.  The  short  pulsewidths  generated  by  the  laser  allow  extremely  high 
pulse  repetition  frequencies-researchers  have  tested  a  prototype  laser  multiplexer  at 
speeds  as  high  as  7  Gbps.  [Ref.  5] 

2.  A  .Suitable  Architecture  Sought 

Current  research  [Refs.  6  -  10]  is  especially  rich  in  parallel-processing 
architectures  based  on  limited  communication  dedicated-path  concepts,  because  shared 
path  communications  typically  involve  delays  which  could  detract  from  the  high 
performance  otherwise  achievable  by  parallel-processing  designs.  Prompted  by  the 
development  of  the  high-speed  Optoelectronic  Multiplexer,  which  promises  an  increase 
in  serial  communication  speed  of  at  least  one  and  perhaps  two  orders  of  magnitude, 
this  project  evaluated  *hc  impact  of  using  a  shared  bus  and  serial  communication  in  a 
parallel  processing  computer  architecture.  Specifically,  the  following  questions  were 
posed:  With  current  technology,  is  it  feasible  to  fabricate  an  Optoelectronic 
Multiplexer-based  multiple  processor  chip?  What  new  architectures  are  made  possible 
by  the  OM's  high  speed?  Which  architecture  makes  optimum  use  of  this  new 
capability? 
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Figure  1.5  Optoelectronic. Multiplexer  Block  Diagram 
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Four  conditions  would  have  to  be  met  in  order  for  &  i  ingle-chip  OM-based 
parallel  processor  to  be  feasible: 

•  IC  manufacturing  technology  should  be  ftble  to  fabricate  enough  transistors  on  a 
single  chip  to  create  a  multi-processor  chip. 

•  A  large  chip  partitioned  into  many  processors  would  produce  higher  throughput 
than  the  same  chip  fabricated  as  a  large  uniprocessor. 

•  Chip  throughput  {measured  in  hits  per  second)  would  exceed  the  capacity  of 
conventional  multiplexers,  justifying  the  use  of  the  OM. 

•  The  package  of  such  a  multiple  processor  chip  would  require  so. many, pins  that 
package  size  would  be  excessive  and  a  multiplexer  would  be  used  instead. 

The  first  condition  is  easily  dealt  with  by  a  specific  example.  The  Intel  8080 
microprocessor  contained  about  4500  transistors  [Ref.  12],  while  Motorola's  MC68020 
contains  about  200000  [Ref.  13].  Using  the  technology  of  the  Motorola  MC68020,  one 
could  produce  a  chip  with  over  40  Intel  808Cs.  Clearly,  manufacturers  can  already 
fabricate  a  multiple-processor  chip.  The  regaining  points  require  further  discussion 
and  are  covered  in  Chapters  II  and  III. 
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IL  OPTIMUM  ARCHITECTURE  Of  LARGE  INTEGRATED  CIRCUITS 


Chapter  I'*  demonstration  that  a  multiple-processor  chip  could  be  fabricated 
prompts  the  following  questions: 


•  Is  a  multiprocessor  chip  the  best  pse  of  1C  fabrication  technology,  or  should  all 
available  transistors  be  assembled  mto  a  single  processor? 


A  PARTIONING  SILICON  FOR  MAXIMUM  THOUGHPUT 

Should  designers  divide  the  available  silicon  among  a  few  large  and  capable 
processors  or  among  many,  less  capable  processors?  Which  mix  yields  the  highest 
throughput? 

Consider  a  system  of  N  processors,  each  executing  the  same  program  and 
producing  the  same  number  of  output  data  words  each  second.  Applications  of  such 
architectures  abound  in  the  field  of  real  time  signal  processing,  which  uses  regularly 
structured  algorithms.  As  N  increases,  processors  share  the  load,  so  each  may  run 
more  slowly  without  changing  the  speed  of  the  system.  If  we  imagine  a  system 
throughput  goal  of  R  bits  per  second  (bps),  then: 

R  -  NS  (eqn  2.1) 


where  R™  System  throughput  (bps) 

N  *•  Number  of  processors 
S“  Throughput  of  each  processor  (bps). 


Sr*q*d  -  RN"1  (eqn  2.2) 

where  Skhi'oi^  Speed  required  of  each  processor 

in  order  to  meet  the  system  goal  of  R  bps. 

These  equations  describe  what  is  required  of  a  proccssor~but  how  does  a 
processor's  actual  performance  vary  with  N?  At  issue  is  the  apportionment  of  the 
entire  chip's  allotment  of  transistors  and  heat  dissipation  ability  among  N  processors. 
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1.  Transistor  Constraints 

Assuming  we  can  put  only  so  many  devices  on  a  chip,  then: 
t  -  TN"X  (eqn  2.3) 

where  t—  complexity  of  any  processor,  measured  in  transistors 
N  •  number  of  processors 
T  *  Total  number  of  transistors  on  chip 

Generally,  a  complex  processor  will  be  able  to  perform  a  given  calculation 
faster  than  a  simple  processor.  For  example,  a  microprocessor  with  an  on-board 
floating-point  unit  can  handle  a  multiplication  in  a  few  clock  cycles,  while  a  smaller 
processor  has  to  do  tedious  successive  additions,  requiring  much  more  time.  But  what 
is  the  exact  relationship  between  processor  complexity  and  speed?  To  answer  this  we 
shall  examine  the  specifications  of  some  existing  processors,  as  listed  in  Table  I  and 
graphed  in  Figure  2.1. 


Group 


CPU's 

1982-85 


FPU's 


TABLE  I 

SPECIFICATIONS  OF  SOME  ACTUAL  PROCESSORS 


Reference  Data. Word 


Time  Required  for 
Multiplication 


JJit 

Rate 


Transistor 

Count 


(10"*  see)  (10*  sec"1)  (thousand) 


gig 

lei.  16 
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JOT  pi  ,01  pi 

(sda)  aaads 


Figure  2.1  Processor  Speed  and  Complexity 
(Experimental). 
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From  the  experimental  relationships  between  processor  speed  and  complexity 
shown  in  Figure  2.1,  we  can  see  that  the  data  in  each  group  are  approximated  by  the 
equation: 

Sproo  —  At*  (®*ln  2.4) 


where  Sproo- processor  speed  (in  bps  throughput) 

t-  processor  complexity  (in  number  of  transistors) 

A  -  empirical  constant  of  proportionality  given  in  Table  II 
a  -  empirical  constant  given  in  Table  II. 


TABLE  II 

EXPERIMENTAL  CONSTANTS 


Group 


Equation  2.4  describes  how,  in  some  typical  one-processor  systems,  processor 
speed  is  related  to  complexity.  To  apply  these  findings  to  a  N-processor  system  of  T 
transistors,  we  combine  equations  2.3  and  2.4: 

Sproo  -  A(TN'1)*  (eqn  2.5) 

Spree  -  AT*N"* 

Sproc  “  KjN~* 

where  Sproo— processor  speed  (in  bps  throughput) 

t= processor  complexity  (in  number  of  transistors) 

N  —  number  of  processors 
A  and  a  are  constants  given  in  Table  II. 
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A  family  of  "processor  curves"  may  be  used  to  describe  the  tradeofF  between 
individual  processor  speed  and  the  number  of  processors,  constrained  by  a  constant 
number  of  transistors.  The  tradeoffs  are  shown  in  Figure  2.2.  For  example,  consider 
the  curve  labeled  "CPU  82-85,"  which  is  based  on  a  constant  106  transistors  per  chip. 
If  these  transistors  are  divided  into  10  processors  of  10s  transistors  each,  Equation  2.4 
predicts  that  each  will  produce  about  116x  106  bps  of  output.  But  if  the  chip  is 
divided  into  more  (for  example  25)  processors  of  4  x  104  transistors  each,  then  these 
less  complex  processors  will  be  capable  of  only  about  17.3  x  io6  bps  each. 

When  we  superimpose  these  processor  curves  (Figure  2.2)  with  a  family  of 
"system"  curves,  generated  by  choosing  several  values  of  "R"  in  Equation  2.2,  the  result 
(Figures  2.3  and  2.4)  yields  a  strategy  for  choosing  N.  Where  the  processor  curve 
(describing  what  the  processor  can  do)  intersects  the  system  curve  (describing  what 
each  processor  must  do  )  determines  the  number  of  processors  (N)  into  which  the  chip 
should  be  divided  to  yield  that  particular  level  of  system  throughput.  For  example,  to 
achieve  a  system  throughput  of  109  bps,  Figure  2.3  shows  the  chip  should  be  divided 
into  about  12  processors  (point  A).  Yet  choosing  to  partition  the  silicon  into  fewer, 
larger  processors  (point  B)  yields  a  higher  system  throughput  of  2  x  io9  bps. 

In  general,  when  processor  speed  is  a  strong  function  of  complexity,  that  is 

when: 

Spree  =  At*  with  a  >  1  (eqn  2.6) 

then  Spree  is  proportional  to  N"*  (a>  1)  while  Sraq'd  is  proportional  to  N*1.  Thus, 
Sproc  falls  faster  than  Sreqd  as  N  increases.  In  this  case,  the  highest  performance  will 
always  result  from  choosing  the  lowest  N  possible,  in  other  words  N=  1.  This  strategy 
may  be  constrained  for  very  large  values  of  T-there  may  not  be  a  processor  design 
which  can  effectively  use  107  transistors,  for  example.  Also,  the  optimistic  relationship 
of  Equation  2.6  may  not  hold  for  large  values  oft. 

On  the  other  hand,  when  a  weak  relationship  exists  between  speed  and 
complexity,  as  shown  in  Figure  2.4,  the  best  strategy  is  to  select  N  as  large  as  possible. 
As  before,  however,  there  are  limits  to  this  rule.  It  may  be  impractical  to  divide  the 
computational  task  beyond  a  certain  point.  For  example,  a  256-point  FFT  probably 


21 


PROCESSOR  CURVES 


o 


Figure  2.2  Processor  Speed  and  Number  of  Processors 
Based  on  a  Constant  Number  of  Transistors. 
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NUMBER  OF  PROCESSORS  N 


SPEED  VS.  NUMBER  OF  PROCESSORS 


Figure  2.3  Relationship  Between  Processor  Capability 
And  System  Requirements  . 

(Speed  a  Strong  Function  of  Complexity). 
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NUMBER  OF  PROCESSORS  (N) 


SPEED  VS.  NUMBER  OF  PROCESSORS 


,ot*  (sda)  aaaas 


Figure  2.4  Relationship  Between  Processor  Capability 
And  System  Requirements 
(Speed  a  Weak  Function  ol  Complexity). 


24 


can  not  be  efficiently  shared  by  more  than  128  x  8  »  1024  processors.1  Also,  as  N 
increases  and  t  decreases,  processors  will  eventually  become  too  simple  to  function  as 
microprocessors.  For  example,  excessive  reduction  in  processor  complexity  could  yield 
a  circuit  unable  to  retain  a  data  word  or  perform  a  basic  calculation. 

2.  Power  Constraints 

Each  chip  can  only  dissipate  a  given  amount  of  heat.  The  power  available  to 
any  individual  processor  is: 

p  -  PN'1  (eqn  2.7) 

where  p  =  Power  available  to  any  one  processor 
N  *  number  of  processors 
P=  Total  power  available  to  the  chip 


TABLE  III 

SPECIFICATIONS  OF  SOME  ACTUAL  PROCESSORS 

Group 

Reference 

Data  Word 

Time  Required  for 

Bit 

Power 

(Bits) 

Multiplication 
(1CTS  sec) 

(lO^sec"1) 

(watts) 

Ref.  1 

14 

32 

8.30 

3.86 

1.50 

Ref. 

5 

16 

6.25 

2.56 

1.40 

NMOS 

Ref.  1 

6 

32 

1.80 

17.8 

7.00 

CPUs 

Ref. 

7 

32 

5.50 

5.82 

0.75 

Ref. 

8 

32 

4.50 

7.11 

2.00 

Ref.  1 

[9 

32 

2.70 

11.85' 

2.10 

Ref.  23 

16 

.090 

178. 

0.125 

Ref.  25 

16 

.060 

267. 

0.065 

Ref.  26 

16 

.045 

356. 

0.100 

CMOS 

Ref.  27 

16 

.027 

593. 

0.15 

FPUs 

Ref.  28 

16 

.079 

405. 

0.195 

Ref.  29 

32 

.100 

320. 

0.40 

Ref.  30 

16 

.065 

246. 

0.200 

Ref.  31 

32 

.100 

320. 

0.500 

Ref.  32 

16 

.065 

246. 

0.10 

Ref.  33 

16 

.130 

123. 

0.275 

1  There  are  256-5-2  =  128  processors  per  stage  and  log2(256)  =  8  stages. 
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Examining  the  relationship  between  processor  speed  and  power  in  the  light  of 
data  from  actual  processors,  (Table  III  and  Figure  2.5)  there  in  no  clear  trend  evident 
in  Figure  2.5.  In  particular,  there  is  a  great  deal  of  scatter  in  the  CMOS  multiplier 
chip  data.  This  may  be  due  to  differences  in  the  way  researchers  repoit  power 
dissipation  data;  for  example,  some  may  report  only  the  power  consumption  of  the 
computational  segment,  while  others  report  the  power  used  by  the  entire  chip, 
including  bus  drivers.  In  spite  of  these  limitations,  one  interpretation  of  the 
power/throughput  data  is: 

Sproc  “  Bpb  (eqn  2.8) 

where  Sproc  “processor  speed  (in  bps  throughput) 
p= processor  power  (in  watts) 

B  and  b  are  empirical  constants  given  in  Table  IV. 

Therefore,  combining  equations  2.5  and  2.8  as  before: 

Sproc  -  B(PN"1)b  (eqn  2.9) 

Sproc  =  BPbN"b 

Sproc  -  KzN"b 

where  Sproc  “processor  speed  (in  bps  throughput) 
p= processor  power  (in  watts) 

N  =  number  of  processors 

B  and  b  are  empirical  constants  given  in  Table  IV. 

Figure  2.6  shows  the  relationship  described  in  equation  2.9,  namely,  the 
tradeoff  of  individual  processor  speed  against  the  number  of  processors,  constrained 
this  time  by  a  constant  power  level,  as  required  by  equation  2.7.  Since,  for  the  group 
of  actual  processors  examined, 

Sproc  =  Bpb  with  b  <  1  (eqn  2. 10) 

Figure  2.7  shows  that  the  best  strategy  is  to  select  N  as  large  as  possible. 
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PROCESSOR  SPEED  AND  POWER 


Figure  2.5  Processor  Speed  and  Power 
(Experimental). 
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POWER  (WATTS) 


TABLE  IV 

EXPERIMENTAL  CONSTANTS 


B.  MINIMUM  CHIP  SIZE  FOR  OM  APPLICATION 

How  large  (in  terms  of  transistor  count,  heat  dissipation,  and  number  of 
processors)  would  a  chip  have  to  be  in  order  to  produce  sufficient  throughput  to  justify 
the  use  cf  the  Optoelectronic  Multiplexer? 

1.  Minimum  Transistor  Count 

Assuming  the  individual  processors  are  of  low  complexity  (like  the  FPU  group 
of  Figure  2.1)  implies  that: 

Sproc  *  At*  with  a  <  1  (eqn  2.11) 

where  Sproc*  processor  speed  (in  bps  throughput) 

t=  processor  complexity  (in  number  of  transistors) 

A  -  empirical  constant  of  proportionality  given  in  Table  II 
a  “empirical  constant ,  here  <  1. 

For  this  group,  the  discussion  in  the  previous  section  shows  that  the 
maximum  throughput  is  achieved  by  partioning  the  available  silicon  into  the  largest 
number  of  processors  possible,  limited  by  the  minimum  complexity  of  the  simplest 
processor  design.2  Therefore: 

Tt„lnl  <“>"  2I2> 

where  T  =  total  number  of  transistors  on  chip 

t^in  =  complexity  of  the  simplest  processor  design,  measured  in  transistors 

number  of  simple  processors  possible  on  chip  of  T  transistors 

2While  the  components  of  systolic  arrays  are  less  complex  than  the  assumed 
simplest  processor,  this  research  did  not  study  the  performance  of  such 
ICs-accordingly  they  are  not  considered  here. 


28 


Figure  2.6  Processor  Speed  and  Number  of  Processors 
Based  on  a  Constant  Chip  Power  Level. 


29 


NMOS  CPU 


SPEED  VS.  NUMBER  OF  PROCESSORS 


Figure  2.7  Relationship  Between  Processor  Capability 
And  System  Requirements 
(Speed  a  Weak  Function  of  Power). 
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NUMBER  OF  PROCESSORS  (N) 


Since  each  processor  produces  an  output  of  S  ^  bits  per  second  and  there 
are  processors,  the  system  throughput  is: 


cymax 


N _ S. 


MV  jfOfl 


(eqn  2.13) 


(eqn  2.14) 


(eqn  2.15) 


“  TAt^*’1  (eqn  2.16) 

Defining  to  be  the  minimum  system  throughput  for  which  use  of  the  OM 
is  justified  leads  to: 

S«  “  TAt.^-1  (eqn  2.17) 

T.it>  =  SJ'.J'TA'1  (“1"  218> 

where  ti|in=s  minimum  number  of  transistors  on  chip  for  OM  usage  to  be  justified 

To  estimate  the  value  of  Twin  assume: 

t  .*4  x  103  transistors  (lower  end  of  FPU  group  in  Table  I) 

A  ==4.22  x  105  (Table  II) 
a  =  0.7 11  (Table  II) 

3  x  io9  bps  (Curently  the  upper  range  of 
conventional  multiplexers.)  (Refs.  34,35,36,37] 
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Therefore: 


T^"  92000  transistors 
N  ■»  13  processors 

Thus,  since  processors  with  transistor  counts  >  T|>ln  are  already  in  existence 
[Ref.  21],  it  seems  that  an  OM -based  single  chip  multiple  processor  is  feasible  with 
respect  to  the  number  of  transistors  required. 

2.  Minimum  Power  Dissipation 

What  is  the  minimum  heat  dissipation  of  a  multi-processor  chip  which  would 
yield  throughput  in  the  OM  range? 

N-a-PlW1  <«!"  219> 


where  P»  Total  power  dissipation  of  the  chip  (watts) 

Pain**  power  used  by  the  simplest  processor  design,  measured  in  watts 
N  |M  ~  number  of  simple  processors  possible  on  chip  of  P  watts 


NS 


(eqn  2.20) 


Substituting  from  Equation  2.8, 


S  -  N  Bp  .  b 

tytMx  mx  r Min 


(eqn  2.21) 


And,  substituting  for  N  from  Equation  2.19, 


SW-«-  lPP.in'1lBP.lob 


(eqn  2.22) 


S - =  PBp,ir, 


b-1 


tytax 


(eqn  2.23) 
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Defining  to  be  the  minimum  system  throughput  for  which  use  of  the  OM 
is  justified  leads  to: 

So«  “  PBP«inb’1  («l»  2-24) 


P.l„  -  S  Z25> 

where  minimum  power  dissipation  of  the  chip  for  OM  usage  to  be  justified 

To  estimate  the  value  of  assume: 

p.in-0.10  watts  (lower  end  of  CMOS  FPU  group  in  Table  III) 

B- 3.43X10®  (Table  IV) 
b- 0.099  (Table  IV) 

S*,- 3X1°*  bps 

Therefore: 

p.i„”UOwatts 

11  processors 

This  power  level  is  quite  reasonable,  and  it  would  seem  that  from  the 
standpoint  of  heat  dissipation  an  OM-based  multiple  processor  chip  is  feasible. 
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III.  THE  NEED  FOR  A  HIGH-SPEED  MULTIPLEXER 


Chapter  II  demonstrated  that  current  technology  could  produce  a  chip  whose 
throughput  would  exceed  the  capacity  of  conventional  multiplexer  technology.  But 
why  consider  serial  communications  and  multiplexers  at  all?  Why  not  exchange  data 
with  the  chip  in  parallel  via  pins  or  leads? 

A.  PROCESSOR  POWER  LIMITED  BY  COMMUNICATION  PATH 

We  have  seen  that  future  high-density  IC's  may  be  optimally  structured  as  a 
bank  of  many  processors,  each  of  moderate  capability.  However,  even  if 
manufacturers  can  achieve  sufficient  circuit  density  to  fabricate  a  multi-processor  chip, 
such  a  device  might  not  be  practical  due  to  the  large  number  of  leads  needed  to 
communicate  with  each  processor  from  off-chip.  For  example,  imagine  an  N-processor 
IC  designed  to  compute  a  2N-point  Fast  Fourier  Transform  (FFT).  During  the 
computation,  the  IC  must  read  in,  then  write  out,  2N  complex  output  words,  or  4N 
real  words.  Assuming  a  40  bit  word  size,  and  using  the  same  pins  for  input  and 
output,  we  can  see  this  IC  would  need: 


40  leads 
word 


x 


1 4N  words] 


160N  leads 


(eqn  3.1) 


How  large  a  package  will  we  need  to  handle  all  these  leads?  Using  a  Pin-Grid 
Array  (PGA)  package  with  pins  spaced  every  0.1  inch,  the  area  of  the  package  is: 


Area 


[  160N  leads] 


10  leads 
25.4  mm 


2 


1032N  mm2 


(eqn  3.2) 
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For  illustrative  purposes  we  can  estimate  the  area  of  the  silicon  chip  in  this 
package  by  assuming  the  chip  size  of  the  processor  is  approximately  the  same  as  that 
of  the  processor  recently  reported  by  the  Matsushita  Corporation  of  Osaka.  [Ref.  28] 
Their  processor  performs  a  32  bit  floating  point  multiplication  in  about  75  nsec  and  is 
32.6  mm*  in  area.  A  chip  containing  N  of  these  processors  would  occupy  about  32.6N 
mm*  of  silicon.  Thus,  the  ratio  of  silicon  area  to  package  area  in  our  hypothetical  IC 
is: 


Ratio  of 


Silicon  Area 
Package  Area 


[  32.6N] 
[  1032N] 


3.2  % 


(eqn  3.3) 


As  IC  fabrication  technology  improves,  this  waste  of  space  gets  even  worse.  A 
new  production  technique  enabling  manufacturers  to  produce  circuits  in  half  the  silicon 
area  previously  required  would  permit  us  to  double  "N"  without  increasing  the  silicon 
area.  Yet  package  area  would  double,  due  to  increased  pinout  requirements.  Once 
some  maximum  package  size  is  reached,  further  improvements  in  circuit  density  do  us 
no  good-we  simply  can  not  communicate  with  more  processors.  As  one  researcher 
stated,  "the  technology  has  become  increasingly  constrained  by  packaging  limitations" 
[Ref.  38]. 

Increasing  lead  density  will  produce  some  relief  from  this  communication  limit, 
but  can  not  be  pursued  beyond  some  maximum  without  excessive  fabrication  cost.  We 
are  faced,  then,  with  some  maximum  package  size  and  maximum  lead  density,  implying, 
an  eventual  limit  on  the  number  of  leads  a  single  IC  can  have. 

Given  this  eventual  limit  on  the  number  of  simultaneous  off-chip  communication 
paths,  Rent's  Rule  [Ref.  12:p.  235] 


p=4G0-6  (eqn  3.4) 

where  P  *  Number  of  chip  pads  or  leads 
G  =  Number  of  gates  on  the  chip 

would  seem  to  imply  that  if  the  number  of  paths  (P)  is  limited,  then  so  is  the  number 
of  gates  (G)  and,  therefore  microprocessor  complexity  and  computational  power. 

This  ultimate  limit  on  non-multiplexed  designs  is  not  precisely  defined.  Neither 
maximum  package  size  nor  maximum  lead  density  have  yet  been  reached,  and  industry 
experts  are  wary  of  predicting  when  they  might  be.  In  addition,  the  switch  to 
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multiplexed  designs  will  probably  occur  over  a  range  of  processor  densities  and 
complexities,  influenced  by  market  factors  (there  will  be  few  customers  for  very  large 
packages)  and  manufacturing  realities  (specialized  chip  sizes  mean  more  expensive  chip 
handling  equipment)  as  well  as  the  theoretical  factors  described  above. 

For  all  these  reasons,  large  ICs  composed  of  multiple  processors  will  require  too 
many  pins  to  use  a  conventional  parallel-transfer  scheme  with  pins  or  leads.  Instead  a 
serial  communications  link  must  be  considered,  and  as  shown  in  Chapter  II,  the  speeds 
required  will  exceed  the  capacity  of  conventional  multiplexers. 
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IV.  SYSTEM  ARCHITECTURE  BASED  ON  SERIAL  COMMUNICATION 


Chapters  II  and  III  demonstrate  that,  in  the  next  generation  of  ICs,  a 
microprocessor  may  very  well  be  organized  as  a  bank  of  smaller  processors,  all  sharing 
a  relatively  few  pins  through  a  high-speed  multiplexer.  But: 

•  What  on-chip  data  flow  architecture  should  be  employed  among  these 
processors? 

•  How  can  a  serial  data  stream  be  distributed  among  N  processors? 

•  What  are  the  detailed  structures  of  the  elements  which  make  up  an  OM-based 
architecture? 

A.  ON-CHIP  DATA  FLOW  ARCHITECTURE 

How  should  a  N-processor  chip  be  organized?  The  ideal  structure  will  vary  with 
the  application;  this  discussion  considers  one  specific  application-computing  FFT's. 
The  number  of  processors  required  to  compute  a  given  size  FFT  will  depend  on 
whether  processors  are  "reused,"  that  is  whether  a  processor  bank's  outputs  are 
shuffled  and  returned  to  the  same  processors  (reused)  or  directed  to  the  next  bank  of 
processors  (pipelined).  Reusing  processors  allows  a  given  FFT  to  be  computed  with 
fewer  processors,  but  takes  more  time.  The  architectures  asssociated  with  both  reuse 
and  pipeline  strategies  are  discussed  in  the  following  sections. 

1.  Pipeline  Architecture 

Assuming  that  the  throughput  of  the  system  is  to  be  maximized,  there  will  be 
no  "reuse"  of  processors.  That  is: 

•  each  processor  performs  only  a  two  point  FFT  "butterfly" 

•  a  new  bank  of  processors  performs  each  stage  of  the  computation  in  a  pipeline 
strategy. 

The  most  straightforward  architecture  for  N  processors  is  a  N  x  1  column. 
How  would  this  grouping  affect  data  flow  among  the  processors?  As  an  example, 
when  the  task  is  a  16  point  FFT,  the  processors  must  exchange  data  as  shown  in 
Figures  4.1  and  4.2.  Dividing  up  the  32  processors  shown  in  Figure  4.2  into  4  x  1 
chips  forces  80  data  words  to  cross  chip  boundaries  during  the  computation,  as  shown 
in  Figure  4.3. 

Re-organizing  the  four  processors  on  each  chip  into  a  2  *  2  matrix  (Figure 
4.4)  results  in  only  48  words  crossing  chip  boundaries,  thereby  improving  the  system's 
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throughput,  since  off-chip  communication  delay  is  lessened,  and  reducing  the  demands 
on  the  communications  network. 

The  2X2  structure  is  more  efficient  because  it  is  the  structure  of  a  four  point 
FFT.  In  a  sense,  the  2  x  2  structure  performs  all  the  computations  possible  on  the 
four  points  it  receives,  while  the  4  x  i  array,  receiving  eight  points,  must  hand  off  its 
data  only  partially  "chewed." 

There  are  many  such  matrices,  each  corresponding  to  a  particular  FFT.  For 
example,  Figure  4.2  suggests  that  a  32  point  processor  IC  designed  for  FFT 
computation  would  best  be  configured  as  a  8  x  4  matrix.  In  general,  the  matrix 
dimensions  are: 

2"’1  x  n  where  n=  1,2,3,...  (eqn4.1) 

2.  Reuse  Architecture 

The  number  of  processors  required  by  a  pipeline  architecture  to  compute  a 
P-point  FFT  is: 

P/2  x  log2P  (eqn  4.2) 

This  number  of  processors  may  prove  to  be  impractical  or  simply  too 
expensive,  or  we  may  not  need  the  ultimate  throughput  achievable  by  the  pipeline 
architecture,  yet  still  need  more  throughput  than  that  provided  by  a  uniprocessor. 
Also,  it  may  be  desirable  to  adapt  an  existing  pipeline  system  to  compute  larger 
FFTs-- without  adding  processors.  In  each  of  these  cases,  reusing  processors  in  the 
computation  enables  the  designer  to  tradeoff  system  throughput  for  design  complexity 
and  cost.  How  are  data  exchanged  among  processors  in  a  reuse  architecture? 

The  computations  shown  in  Figure  4.2  still  must  be  performed,  but  now 
instead  of  each  block  representing  an  actual  processor,  it  represents  a  job  that  some 
processor  will  have  to  perform.  For  example,  consider  a  16-point  FFT  performed  with 
two  4-processor  ICs.  Figure  4.5  shows  the  data  exchange  for  this  example  if  the 
four-processor  ICs  are  organized  as  4  x  l  vectors. 

As  shown  in  Table  V,  even  though  a  chip  processes  eight  points  every  frame, 
it  only  transmits  four  points  per  frame-keeping  half  its  data  onboard  for  further 
processing  with  the  half  it  will  receive  from  the  other  IC.  This  assumes  that  some 
on-chip  communication  path  exists  to  enable  processors  to  exchange  data. 
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Figure  4.1  Sixteen  Point  Fast  Fourier  Transform 
[Ref.  3  pp.  2019-22]. 
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Figure  4.2  Sixteen  Point  Fast  Fourier  Transform 
Pipline  Implementation. 
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Figure  4.3  Sixteen  Point  Fast  Fourier  Transform 
Perf  ormed  by  4  x  1  Chips. 


41 


CHIP  BOUNDORY 


15  Data  words  15  Data  words  15  Data  words 
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CHIP  BOUNDARY 


SAME  CHIP  AT  DIFFERENT  INSTANTS 


Figure  4.5  Sixteen  Point  Fast  Fourier  Transform 
4  *  1  Reuse  Architecture. 
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CHIP  DOUNDORY 


TABLE  V 

INTER-PROCESSOR  COMMUNICATIONS 

4  X  1  REUSE  ARCHITECTURE 

Stage 

# 

Chip  A 

Receives  Transmits 

Receives 

Chip  B 

Transmits 

1 

fo 

fi 

fa 

ft 

fz  fio 
ft  fll 

ai  as  a6  a7 

ft 

ft 

fxz  ft 
fl4  f7 

fl3 

fis 

ae 

a* 

aio  an 

2 

a  s 

a9 

aio  an 

b 4  bs  bo  b7 

a4 

as  a* 

a7 

be 

b» 

bio  bn 

3 

b8 

b* 

bxo  bxx 

Cl  C3  C5  C7 

b4 

bs  b6 

b7 

CS 

CIO 

C12  C14 

4 

C8 

CIO 

C12  C14 

Fo  Fa  F*  Fiz 
Fz  Fio  F6  Fi4 

Cl 

C3  C5 

C7 

& 

F» 

Fis 

R  PS 

TABLE  VI 

INTER-PROCESSOR  COMMUNICATIONS 
2X2  REUSE  ARCHITECTURE 


Stage 

# 

Receives 

Chip  A 

Transmits 

Receives 

Chip  B 

Transmits 

1 

fo 

ft  ft 

fl2 

— 

fl 

ft 

ft 

fl3 

2 

fz 

fio  ft 

fl4 

bi 

b3  b» 

bxx 

ft 

fn 

f7 

fxs  b4  bxz  b6 

bi4 

3 

b4 

b6  biz  bi4 

Fo 

Fz 

F8  F4 
Fio  F6 

Fiz 

Fl4 

bi 

bs 

b9 

bn  Fl  F9  F3 
Fs  Fl3  F7 

Fii 

Fis 

As  an  alternative,  consider  the  same  16-point  FFT  computed  by  two 
4-processor  ICs,  this  time  organized  as  2  x  2  matrices,  as  shown  in  Figure  4.6  and 
Table  VI. 

Because  each  IC  is  only  two  processors  "wide,"  a  single  IC  can  only  accept 
four  data  points  at  a  time.  This  creates  an  awkward  data  flow-the  source  delivers  only 
half  the  input  vector,  waits,  then  delivers  the  other  half.  Each  chip  must  store  the 
output  of  its  first  computation  while  processing  the  second  half  of  the  input  vector. 
However,  the  number  of  data  points  exchanged  between  chips  is  sharply  reduced  from 
24  for  the  4><1  case  to  8  for  the  2  x  2  case. 
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Figure  4.6  Sixteen  Point  Fast  Fourier  Transform 
2x2  Reuse  Architecture. 
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CHIP  BOUNDARY 


TABLE  VII 


Stage  Chip  A  Chip  B 

#  Receives  Transmits  Receives  Transmits 

1  fo  fa  ft  fio  bi  bs  bo  bn  fi  fo  p  pi  b*  biz  b*  bi* 

2  b*  b*  bit  bi*  Fo  Fo  Fo  Fit  bi  bs  bo  bn  Fi  Fo  Fs  Fn 

Ft  Fio  Fo  Fi*  Fs  Fis  F7  Fib 


A  third  organization  of  these  four  processors  permits  transmission  of  the 
entire  data  vector  (as  in  the  4><1  chip)  and  minimizes  the  data  exchange  (as  in  the  2 
x  2  chip).  Its  structure  is  shown  in  Figure  4.7  and  Table  VII. 

This  structure,  possible  only  if  processors  are  reused,  maximizes  the  "width"  of 
the  chip  while  preserving  the  communication  advantages  of  a  "deep"  chip.  As 
discussed  in  the  previous  section,  these  advantages  stem  from  performing  all  the 
calculations  possible  on  a  given  data  set  before  releasing  is  to  another  chip.  By  not 
allowing  "partially  chewed"  data  ofT  the  chip,  the  number  of  data  to  be  exchanged 
between  chips  at  each  stage  is  minimized.  In  general,  an  N-processor  chip  with  this 
reuse  architecture  can  perform  a  2N-point  FFT  if  organized  as  an  N  x  l  vector  which 
performs  1  +  log2N  stages. 

3.  Interleaving  Data  Sets 

The  efficiency  and  throughput  of  any  of  these  reuse  architectures  can  be 
improved  through  interleaving  data  sets— that  is,  delivering  new  data  to  the  processors 
to  work  on  while  they  wait  for  the  communications  link  to  recycle  their  intermediate 
outputs  back  to  their  inputs.  Consider  the  progress  of  a  16-point  FFT  calculation 
performed  by  eight  processors  organized  as  in  Figure  4.7.  The  processor  wait  time  is 
deary  evident  in  Figure  4.8,  in  which  the  data  sets  are  not  interleaved.  In  this 
example,  the  throughput  is  one  FFT  per  (4Tcalc  +  Txfr). 

In  Figure  4.9,  however,  a  new  data  set  is  delivered  to  the  processors  while  they 
wait  for  the  results  of  the  first  phase  of  the  calculation  to  be  redrculated.  In  this 
interleaved  case,  processors  are  never  allowed  to  be  idle.  For  this  example,  throughput 
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Figure  4.7  Sixteen  Point  Fast  Fourier  Transform 
Modified  4  x  1  Reuse  Architecture. 
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CHIP  DOUNDHRY _ ^ 


-F0-F8  -F4-F12  -F2-F10  -F6  -f  14  -FI  -F9  -F5-F13  -F3F11  -FT-F15 


Figure  4.8  Sixteen  Point  Fast  Fourier  Transform 
Modified  4  x  I  Reuse  Architecture 
Non- Interleaved. 
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F0F0  F4F12  F2F10  F6F14  FI  F9  F5  F13  F3F11  F7F15 


— 8T T=6T T=4T  T=2T  Time 


fa  fa  ■  f 4  f  12  fa  f  10  f b  f  14  f  l  fa  f b  f  ia  fa  f  1 1  f v  f  ib 


<3 


h- 


f 0  f 8  f 4  f  12  f 2  f  10  f 6  f  14  f  1  f 9  f 5  f  13  f3fll  f 7  f  15 


b0b2  bl  b3  b8bl0  b9bll  b4b6  b5b7  bl2bl4  bl3blE 
b0b8  b4bl2  b2bl0  b6bl4  bl  b9  b5bl3  b3bll  b7bl5 


b0b2  bl  b3  b8b!0  b9bll  b4  b6 


b5b7  b  12b  14  bl3bl5 


b0b8  b4  bl2  b2b!0  b6bl4  blb9  b5bl3  b3bll  b7bl5 


F0F8  F4F12  F2  F10  F6F14  FI  FS  F5F13  F3F11  F7  F15 

ii  ii  iiii  1 1  •  •  i  •  1 1  iii  iiii  i  •  •  i  1 1  i  • 

f 0  f 8  f 4  f  12  f2  f  10  f6  f  14  f  1  f9  f5fl3  f3fll  f7fl5 


F0F8  F4F12  F2  F10  F6  F14  FI  F9  F5F13  F3F11  F7  F15 


Figure  4.9  Sixteen  Point  Fast  Fourier  Transform 
Modified  4  x  1  Reuse  Architecture 
Interleaved. 
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is  2  FFTs  per  8Tcalc,  or  1  per  4Tc*LC--slightly  higher  than  in  the  non-interleaved  case. 
This  improvement  in  throughput  was  achieved  without  an  increase  in  bus  speed; 
alternatively  one  could  reduce  bus  speed  requirements  without  lowering  throughput  by 
incorporating  an  interleaved  reuse  architecture. 

B.  DATA  DISTRIBUTION 

Data  delivery  to  the  processors  can  be  accomplished  several  ways: 

•  processing  elements  all  receive  the  same  data  in  broadcast  fashion 

•  al1^ processors  "know"  when  it's  their  turn  to  receive  data  and  they  query  the 

•  data  words  are  "tagged"  with  their  destination-  RB I U  reads  the  tag  and  delivers 

data  words  to  their  intended  processor 

•  processors  contend  for  bus  access  with  each  other 

•  RBIU  delivers  data  to  processors  in  a  preset  schedule. 

Only  this  last  scheme  (using  a  preset  schedule)  promises  to  have  sufficient  speed 
to  be  acceptable  for  use  with  the  OM.  But  is  it  possible  to  use  an  a  priori  schedule, 
and  what  would  it  look  like? 

1.  Pipeline  Architecture 

Returning  to  the  example  of  a  FFT  computer  built  of  N-processor  ICs,  Figure 
4.4  shows  the  data  exchanges  required  by  a  sixteen  point  FFT  if  a  pipeline  architecture 
is  used. 

The  sequence  of  data  on  the  bus  is  essentially  arbitrary.  In  choosing  the 
sequence,  it  is  reasonable  to  avoid  sequences  which  deliver  several  data  words  to  the 
same  BIU  one  right  after  the  other,  in  order  to  minimize  the  speed  required  of  the  BIU. 
Figure  4. 10  shows  one  suitable  choice. 

Due  to  the  regular  structure  of  the  FFT,  there  is  a  simple  algorithm  to 
calculate  the  address  of  any  data  word's  destination,  based  only  on  its  position  in  the 
data  stream,  as  shown  in  Table  VIII.  Because  of  this,  the  RBIU's  data  distribution 
logic  can  be  implemented  with  little  more  than  a  binary  counter.  The  transmission 
algorithm  is  equally  uncomplicated,  as  shown  in  Table  IX. 

The  fact  that  inter-stage  data  exchange  patterns  in  the  FFT  computation  are 
regular  and  easily  implemented  in  hardware  lends  further  support  to  the  use  of  preset 
schedules  to  control  BIU  data  distribution. 
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Figure  4.10  Sixteen  Point  Fast  Fourier  Transform 
Distributed  Among  Four  Multi-Processor  Chips. 
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TABLE  VIII 

PRESET  SCHEDULE  FOR  DATA  DISTRIBUTION 

PIPELINE  ARCHITECTURE 

Word 

Data 

Destination  Ad 

dress 

Sequence 

Word 

RBIU 

Buf 

er 

w3 

W2 

w  w 

1  wo 

D2 

D1 

Do 

0 

0 

0 

0 

m 

0 

0 

0 

0 

0 

0 

0 

1 

f5 

1 

0 

1 

0 

0 

0 

1 

0 

ft 

0 

0 

0 

1 

0 

0 

1 

1 

fl3 

1 

0 

1 

1 

0 

1 

0 

0 

f2 

0 

1 

0 

0 

0 

1 

0 

1 

f7 

1 

1 

1 

0 

0 

1 

1 

0 

fio 

0 

1 

0 

1 

0 

1 

1 

1 

fl5 

1 

1 

1 

1 

1 

0 

0 

0 

f4 

0 

0 

1 

0 

1 

0 

0 

1 

fl 

1 

0 

0 

0 

1 

0 

1 

0 

fl2 

0 

0 

1 

1 

1 

0 

1 

1 

f9 

1 

0 

0 

1 

1 

1 

0 

0 

f6 

0 

I 

1 

0 

1 

1 

0 

1 

f3 

1 

1 

0 

0 

1 

1 

1 

0 

fl4 

0 

1 

1 

1 

1 

1 

1 

1 

fll 

1 

1 

0 

1 

d3-  w0 

D2=  W2 

D1  “  W0©W3 
D0=  W, 

2.  Reuse  Architecture 

Tables  X  and  XI  and  Figure  4. 1 1  show  the  data  flow  structure  required  to 
compute  a  16-point  FFT  with  a  reuse  architecture.  Although  the  task  is  accomplished 
with  fewer  processors  than  in  Figure  4.10,  there  are  three  additional  complications: 

•  additional  buffers  directly  connect  processors  which  must  exchange  data  in 
intermediate  stages  of  the  calculation 

•  an  internal  path  exists  between  TBIU  and  RBIU  to  allow  processors  which  are 
not  directly  connected  to  exchange  data 

•  BIUs  must  coordinate  the  use  of  internal  and  external  paths. 
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TABLE  IX 


PRESET 


SCHEDULE  FOR  DATA  DISTRIBUTION 
PIPELINE  ARCHITECTURE 


Word 

Sequence 

W3  w?  Wj  w0 


Data 

Word 


Source  Address 
TBIU  Buffer 

S3  S2  S1  S0 


0 

0 

0 

0 

0 

0 

0 

0 

I 

I 

1 

I 

1 

1 

i 

1 


0 

0 

0 

0 

1 

1 

1 

1 

0 

0 

0 

0 

1 

1 

1 

1 


0 

0 

1 

1 

0 

0 

1 

1 

0 

0 

1 

1 

0 

0 

1 

1 


0 

1 

0 

1 

0 

1 

0 

1 

0 

1 

0 

1 

0 

1 

0 

1 


bo 

b5 

b8 

bi3 

b2 

b7 

bio 

bis 

b4 

bl 

bl2 

b9 

b6 

b3 

bl4 

bll 


0  0 


I 

0 

1 


1 

0 

I 

I 


0 

I 


0 

I 

1 


0  0 


0 

I 

1 

0 


0  0 

I  I 


I 

0 


0  0 

1  1 
0  1 


0 

0 


I 

1 

I 

I 

0 

0 

0 

0 

I 

I 

I 


0 

I 

0 

I 

0 

1 

0 

I 

0 

I 

0 

I 

0 

1 

0 

1 


S3  -  W0©w3 

s2-  W1 

Si«w2 


C.  RECEIVER  TASKS 

We  can  view  the  data  distribution  circuitry  as  being  separated  into  a  Receiving 
Bus  Interface  Unit  (RBIU)  and  a  Transmitting  Bus  Interface  Unit  (TBIU).  The  RBIU 
must: 

•  capture  data  from  the  high-speed  bus 

•  convert  data  from  serial  to  parallel  format 

•  perform  error  detection/correction 

•  deliver  the  data  word  to  its  destination  processor. 

Figure  4.12  shows  the  architecture  developed  in  this  project  to  accomplish  these 
tasks.  It  may  be  noted  that  this  architecture  uses  a  separately  distributed  clock  signal. 
This  scheme  was  used  to  simplfy  the  construction  and  testing  of  a  system  prototype. 
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TABLE  X 

PRESET  SCHEDULE  FOR  DATA  DISTRIBUTION 
REUSE  ARCHITECTURE 


Word 
Sequence 
w3  Wj  W,  W0 

0  0  0  0 


0  0 
0  0 
0  0 


0 

0 

0 

0 


0 

0 


0 

I 

I 

0 

0 

1 

l 


0  0 
0  0 


0 

1 

1 


1 

0 

l 

0 

1 

0 

1 

0 

1 

0 

l 


0  0 


1 

0 

1 


Data 

Word 


fi) 

T 

f8 

'9 

'4 

.'13 

ft 

'3 

10 

'll 

.'6 

ft 

14 

15 


Destination  Address 
RBIU  Buffer 

d3  d2  d0 

0  0  0  0 
10  0  0 
1 


0  0  0 
1  0  0 
0  0  1 
0 


1 

0  0 
1  0 
0 
1 
0 
1 

0 
1 
0 
1 


0 

0 

1 

1 

1 

1 


1 

0 

0 

1 

1 


0  0 
0  0 


1 

1 

0 

0 

1 

1 


D, 


w„ 


d2  =  w3 
Dj  -  w2 

Dn  =  W, 


but  once  past  this  phase  the  clock  could  be  embedded  in  the  data  stream  itself  (as  in 
Manchester  coding),  eliminating  the  need  for  a  separate  clock  line.  Alternatively,  if  a 
fiber  optic  data  link  were  used,  the  clock  could  be  sent  on  the  same  fiber  as  the  data, 
but  at  a  different  carrier  frequency  (color),  allowing  clock  recovery  independently  of 
data  reception. 

The  control  signals  shown  in  Figure  4.12  also  deserve  some  discussion.  The 
RBIU  circuitry  develops  these  signals  as  a  function  of  the  bit  count,  then  distributes 
the  signals  depending  on  which  word  is  currently  being  received.  These  signals  control 
the  First-In- First-Out  (FIFO)  stacks  which  buffer  data  between  the  RBIU  and  the 
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TABLE  XI 

PRESET  SCI 

W3 

Word 

Sequence 

Data 

Word 

Source  Address 

TBIU  Buffer 

w2 

W1 

wo 

S3 

S2 

S1 

so 

0 

0 

0 

0 

l 

bO 

0 

0 

0 

0 

0 

0 

0 

bi 

0 

0 

1 

0 

0 

0 

1 

0 

b8 

0 

1 

0 

0 

0 

0 

1 

1 

b9 

0 

1 

1 

0 

0 

1 

0 

0 

b4 

1 

0 

0 

0 

0 

1 

0 

1 

b5 

1 

0 

1 

0 

0 

1 

1 

0 

bl2 

1 

1 

0 

0 

0 

1 

1 

1 

bi3 

1 

1 

1 

0 

1 

0 

0 

0 

b2 

0 

0 

0 

1 

1 

0 

0 

1 

b3 

0 

0 

1 

1 

1 

0 

1 

0 

bio 

0 

1 

0 

1 

1 

0 

1 

1 

bll 

0 

1 

1 

1 

1 

1 

« 

I 

0 

0 

b6 

1 

0 

0 

1 

1 

0 

1 

b7 

1 

0 

1 

1 

1 

1 

1 

0 

bl4 

1 

1 

0 

1 

1 

1 

1 

1 

bis 

1 

1 

1 

1 

S3  = 

W2 

S2“ 

W1 

S1  “ 

W0 

*0  = 

W3 

Processing  Elements  (P/E),  as  well  as  between  the  P/Es  and  the  TBIU.  These  FIFOs 
require  signals  to  cause  them  to: 

•  load  a  new  data  word  (from  the  RBIU) 

•  output  the  next  word  (to  the  TBIU) 

•  advance  the  stack  to  bring  up  the  next  output  word  (now  that  the  TBIU  has  the 
current  word) 

D.  TRANSMITTER  TASKS 

The  transmission  part  of  the  data  distribution  circuitry  must: 

•  take  the  data  word  from  its  source  processor. 

•  convert  data  from  parallel  to  serial  format 

•  add  error  detection  bits 

•  insert  data  onto  the  high-speed  bus 
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Figure  A.  1 1  Sixteen  Point  Fast  Fourier  Transform 
Distributed  Between  Two  Chips  Using  a  Reuse  Architecture. 
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SERIAL  DATA  BUS 


Figure  4.12  Receiving  Bus  Interface  Unit  Architecture. 
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INPUT  OUTPUT 

)FFERS  BUFFERS 


RtPNUUUttU  «  I  ^uvcmumcii  i  r. Art 


FROM  OUTPUT  BUFTTRS 


Figure  4.13  Transmitting  Bus  Interface  Unit  Architecture. 

Figure  4.13  shows  the  architecture  developed  in  this  project  to  accomplish  these 
tasks.  The  control  and  timing  circuitry  needed  to  interface  the  output  FIFOs  with  the 
TBIU  is  included  as  part  of  the  RBIU  diagram. 

E.  CONCLUSIONS  AND  LIMITATIONS  OF  THIS  RESEARCH 
1.  Conclusions 

The  high  speed  serial  communication  provided  by  the  Optoelectonic 
Multiplexer  makes  possible  a  shared-bus  parallel  processing  architecture  for  problems 
like  the  FFT  where  the  data  distribution  schedule  can  be  determined  a  priori.  The  data 
distribution  algorithms  for  the  FFT  are  quite  simple  and  can  be  realized  with  little 
more  than  a  binary  counter. 

For  the  FFT,  processors  groupings  on  chip  should  correspond  to  the  2n_1  *  n 
matrices  inherent  in  the  FFT  calculation  in  order  to  minimize  the  amount  of  inter-chip 
communications. 

Trends  in  actual  processor  data  suggest  that  the  throughput  of  the  processor 
in  most  cases  is  proportional  to  the  [size  of  the  processor]^,  where  X  <  1  and  "size” 
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refers  to  both  transistor  count  and  power  dissipation.  This  implies  that,  for  a  given 
chip  size,  dividing  the  chip  into  increasing  numbers  of  smaller  processors  raises  the 
number  of  processors  faster  than  it  lowers  the  throughput  of  an  individual  processor. 
Thus,  for  most  types  of  processors  studied,  the  greatest  throughput  is  achieved  by 
organizing  a  large  chip  as  a  bank  of  many  simple  processors. 

Finally,  a  single-chip  OM-based  parallel  processor  is  feasible  since: 

•  Manufacturers  can  fabricate  sufficient  transsitors  on  a  single  chip  to  construct 
many  simple  processors. 

•  A  chip  composed  of  only  about  12  simple  processors,  easily  achieved  with  current 
fabrication  technology,  could  produce  enough  throughput  in  a  highly  structured 
problem  (like  the  FFTj  to  justify  the  use  of  the  OM's  nigh  capacity. 

•  Constructing  such  a  chip  in  a  conventional  package  using  one  pin  or  lead  per  bit 
would  require  an  excessive  package  size. 

2.  Limitations  and  Recommendations 

The  architecture  described  in  this  report  was  designed  with  only  the  FFT  in 
mind.  It  may  not  be  adaptable  to  less  structured  calculations  or  to  systems  which 
must  perform  a  wide  variety  of  calculations. 

Multiple- processor  chip  performance  was  predicted  based  on  a  limited 
sampling  of  current  processor  data.  Further  research,  using  a  comprehensive  study  of 
actual  processor  performance,  is  needed  to  augment  the  simple  model  developed  here. 

The  comparison  of  conventional  leaded  packages  and  serially  multiplexed 
packages  considered  only  the  extremes  of  one  pin  per  bit  and  one  pin  per  chip. 
Additional  study  of  alternatives  between  these  endpoints  is  needed  to  determine  at 
what  point  the  cost  (in  terms  of  dollars,  chip  area,  and  heat)  of  the  Optoelectronic 
Multiplexer  is  justified  by  its  higher  performance. 
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